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The photograph and our understanding of photography is 
ever changing and has transitioned from a world of unpro¬ 
cessed rolls of C-41 sitting in a fridge 50 years ago to sharing 
photos on the 1.5” screen of a point and shoot camera 10 
years back. And today the photograph is again something 
different. The way we take photos is fundamentally differ¬ 
ent. We can view, share, and interact with photos on the 
device they were taken on. We can edit, tag, or “filter” pho¬ 
tos directly on the camera at the same time the photo is 
being taken. Photos can be automatically pushed to various 
online sharing services, and the distinction between photos 
and videos has lessened. Beyond this, and more importantly, 
there are now lots of them. To Facebook alone more than 250 
billion photos have been uploaded and on average it receives 
over 350 million new photos every day [B], while YouTube re¬ 
ports that 300 hours of video are uploaded every minute [22]. 
A back of the envelope estimation reports 10% of all photos 
in the world were taken in the last 12 months, and that was 
calculated already more than three years ago [8]. 

Today, a large number of the digital media objects that 
are shared have been uploaded to services like Flickr or In- 
stagram, which along with their metadata and their social 
ecosystem form a vibrant environment for finding solutions 
to many research questions at scale. Photos and videos pro¬ 
vide a wealth of information about the universe, covering 
entertainment, travel, personal records, and various other 
aspects of life in general as it was when they were taken. 
Considered collectively, they represent knowledge that goes 
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beyond what is captured in any individual snapshot and 
provide information on trends, evidence of phenomena or 
events, social context, and societal dynamics. Consequently, 
collections of media are useful for qualitative and quantita¬ 
tive empirical research in many domains. However, scientific 
endeavors in fields like social computing and computer vi¬ 
sion have generally relied on independently collected mul¬ 
timedia datasets, which complicates research growth and 
synergy. There is the need for a more substantial dataset 
for researchers, engineers, and scientists around the globe. 

To meet the call for scale, openness, and diversity in aca¬ 
demic datasets, we take the opportunity in this article to 
present a new multimedia dataset containing 100 million 
media objects and explain the rationale behind its creation. 
We discuss the implications it has for science, research, en¬ 
gineering and development, and demonstrate its usefulness 
towards tackling a broad range of problems in various do¬ 
mains. With the release of our dataset comes the opportu¬ 
nity to advance research, giving rise to new challenges and 
solving existing ones. 

1. SHARING DATASETS 

Datasets are critical for research and exploration m as, 
rather obviously, data is required for performing experiments, 
validating hypotheses, analyzing designs, and building ap¬ 
plications. Over the years a plurality of multimedia datasets 
have been put together for research and development. In Ta¬ 
ble [l] we present a summary of the most popular datasets 
at present. Most of them, however, actually cannot truly be 
called multimedia datasets due to only containing a single 
type of media, rather than a mixture of modalities, such as 
photos, videos, and audio. Datasets range from the one-off 
instances that have been exclusively created for supporting 
the work presented in a single paper or demo (‘short-term 
datasets’) to those that have been created with multiple re¬ 
lated or separate endeavors in mind (‘long-term datasets’). 
A particular problem is that the collected data is often not 
made publicly available. While this sometimes is out of ne¬ 
cessity due to the proprietary or sensitive nature of the data, 
this is certainly not always the case. 

The question of sharing data for replication and growth 
has arisen several times in the past 30 years alone 013 ESI 



Table 1: Popular multimedia datasets used over the years by the research community. When various versions of a particular 
collection were available, we generally only included the most recent one. PASCAL, TRECVID, MediaEval and ImageCLEF 
are/were yearly recurring benchmarks that consist of one or more challenges of which each has its own dataset; in the table 
we report the total number of media objects aggregated over all datasets part of the most recent edition of each benchmark. 
The meaning of the icons is as follows: © Some or all content in dataset is copyrighted. © All content in dataset has a 
Creative Commons license. (J) Content in dataset can be freely used on condition of citing the dataset paper. © Dataset 
has to be downloaded. Dataset has to be purchased, dl Dataset is delivered by mail. £ Dataset can only be obtained 
by accepting a license agreement. 4!) Dataset can only be obtained after creating an account. 'T Dataset can only be 
obtained by participating in a benchmark competition. % Dataset contains URLs to the content instead of the content itself, 
★/ir Dataset is fully/partially annotated with class labels. A Dataset contains content found by querying search engines with 
dictionary words. GHS Dataset contains generated features. >_ Dataset contains subtitles, transcripts, or captions describing 
the content. := Dataset contains search engine click log data. & Dataset contains user information. HI Dataset contains 
camera information. © Dataset contains timestamps. @ Dataset contains locations. Dataset contains tags. □ Dataset 
contains object bounding boxes. O Dataset contains object segmentations. A Dataset is still evolving. O’ Dataset changes 
from year to year. 


Year 

Dataset 

Type 

Image 

Video 

Audio 

License 

Accessibility 

Content 

1966 

Brodatz 

texture 

<1K 

- 

- 

© 

© 


1996 

COIL-100 

object 

7K 

- 

- 


© 

★ 

1996 

Corel 

stock 

60K 

- 

- 

© 


★ 

2000 

FERET 

face 

14K 

- 

- 

© 


★ 

2005 

Yale Face B-f- 

face 

16K 

- 

- 

© 

© 

★ 

2005 

Ponce 

texture 

IK 

- 

- 

© 

© 

★ 

2007 

Caltech-256 

object 

30K 

- 

- 

® 

© 

★ 

2007 

Oxford 

buildings 

5K 

- 

- 

© 

© 

★ 

2008 

CMU Multi-PIE 

face 

750K 

- 

- 

© 

m s 

★ 

2008 

Tiny Images 

web 

80M 

- 

- 

© 

© 

t AH 

2008 

MIRFLICKR-25K 

Flickr 

25K 

- 

- 

© 

© 

★H©H 

2009 

NUS-WIDE 

Flickr 

270K 

- 

- 

© 

®% 

★ H©H©«> 

2009 

ImageNet 

web 

14M 

- 

- 

© 

©^'43 

i ABO 

2010 

SUN 

web 

131K 

- 

- 

© 

© 

★ AH 

2010 

MIRFLICKR-1M 

Flickr 

1M 

- 

- 

© 

© 

ir H©B 

2012 

PASCAL 

Flickr 

23K 

- 

- 

© 

© £ 41 'f* 

★ □ O 

2013 

MS Clickture 

web 

40M 

- 

- 

© 

m s 4i v* 

:= 

2014 

Sports-IM 

sports 

- 

1M 

- 

© 

© v** 

★ 

2014 

MS COCO 

Flickr 

330K 

- 

- 

© 

© 

★ A © O >- A 

2014 

YFCC100M 

Flickr 

99M 

800K 

- 

© 

© 4) V*** 

ir H©HAQ«> A 

2015 

TRECVID 

mixed 

- 

220K 

- 

© 

© S 4) <*> 

★ >_ C 

2015 

MediaEval 

mixed 

6M 

51K 

1.4K 

© 

©^'43 fV**" 

ir H □ >_ C 

2015 

ImageCLEF 

mixed 

500K 

- 

- 

© 

© £ 43 

ir AH 0 


* The PASCAL training and development data can be freely downloaded, but the test data requires registration. 

** Reduced resolution images are included in the dataset, while full resolution images need to be downloaded separately. 
*** qijjg Sports-IM dataset has a CC license, although videos it links to are hosted on YouTube and copyrighted. 


**** r jq )e photQg and videos are currently being uploaded to the cloud and will be available for download afterwards. 

***** Most of the MediaEval challenges include the media objects in their dataset, but some provide only URLs. In previous 
editions data had to be purchased and delivered by mail in order to participate in certain challenges. 


and has been brought into discussion in ACM’s SIGCHI (20) . 
This “sharing discussion” reveals many of the underlying 
complexities with sharing, both with regards to the data 
(e.g. what exactly is considered data) and to the sharing 
point of view (e.g. incentives and disincentives for doing so). 
For example, one might be reluctant to share data freely, 
as it has a value from the often substantial amount of time, 
effort, and money that was invested in collecting it. 

Another barrier to sharing arises when data is harvested 
for research under a general umbrella of “academic fair use” 
without regards towards its licensing terms. Beyond the le¬ 
gal corporate issues, this clearly may violate the copyright 
of the owner of the data that, in many User-Generated Con¬ 
tent (UGC) sites like Flickr, remains with the creator. The 
Creative Commons (CC), a nonprofit organization that was 
founded in 2001, seeks to build a rich public domain of “some 
rights reserved” media, sometimes referred to as the copyleft 
movement. Their licenses allow media owners to communi¬ 


cate how they would like their media to be rights reserved. 
For example, an owner can indicate that a photo may be 
used only for non-commercial purposes or whether someone 
is allowed to remix it or turn it into a collage. Depending 
on how the licensing options are chosen, CC licenses can 
be applied that are more restrictive (e.g. CC Attribution- 
NonCommercial-NoDerivs or CC-BY-NC-ND license) or less re¬ 
strictive (e.g. CC Attribution-ShareAlike or CC-BY-SA) in 
nature. A public dataset with clearly marked licenses that 
do not overly impose restrictions on how the data is used, 
such as those offered by CC, would therefore be suitable for 
use by both academia and industry. 

We underscore the importance of sharing—perhaps even 
its principal argument—is that it ensures data equality for 
research. While the availability of data alone may not nec¬ 
essarily be sufficient for the exact reproduction of scien¬ 
tific results (since the original experimental conditions also 
would need to be replicated as closely as possible, which 





may not always be possible), research should start with pub¬ 
licly sharable and legally usable data that is flexible and 
rich enough to promote new advancements, rather than with 
data that only serves as a one time collection for a specific 
task and that cannot be shared. Shared datasets can play a 
singular role in achieving research growth and in facilitat¬ 
ing synergy within the research community that is otherwise 
difficult to attain. 

2. THE YFCC100M DATASET 

We created the Yahoo Flickr Creative Commons 100 Mil¬ 
lion DataselQ (YFCC100M) as part of the Yahoo Webscope 
program. This dataset is the largest public multimedia col¬ 
lection that has ever been released, comprising a total of 
100 million media objects, of which approximately 99.2 mil¬ 
lion are photos and 0.8 million are videos, all of which have 
been uploaded to Flickr between 2004 and 2014 and pub¬ 
lished under a CC commercial or non-commercial license. 
The dataset is currently distributed via Amazon AWS as a 
12.5GB compressed archive containing only metadata. How¬ 
ever, as is the case with many datasets, the YFCC100M is 
constantly evolving, and over time we will release various 
expansion packs containing data not yet present in the col¬ 
lection. For instance, several visual and aural features ex¬ 
tracted from the data have already been made availably 
The actual photos and videos are further currently being 
uploaded to the cloud to ensure the dataset will remain ac¬ 
cessible and intact for years to come; like the metadata, the 
photo and video data can then be mounted as a read-only 
network drive. Our dataset overcomes many of the issues af¬ 
fecting existing datasets, for instance in terms of modalities, 
metadata, licensing, and principally volume —we will discuss 
the strengths and limitations of our collection in more detail 
below. 

2.1 Metadata 

Each media object included in the dataset is represented by 
its metadata in the form its Flickr identifier, the user that 
created it, the camera that took it, the time at which it 
was taken and when it was uploaded, the location where it 
was taken (if available), and the CC license it was published 
under. In addition, the title, description and tags are also 
available, as well as direct links to its page and its content 
on Flickr. Social features, comments, favorites, and follow¬ 
ers/following data are not included in the dataset as, by 
their nature, these change on a day to day basis. They can, 
however, be easily obtained by querying the Flickr APQ 
and over time we will make snapshots of such social features 
available for download. We are further working towards re¬ 
leasing the Exif metadata of the photos and videos as an 
expansion pack. 

Tags. There are 68,552,616 photos and 418,507 videos in 
the dataset that have been annotated with user tags (or key¬ 
words). The tags make for a rich and diverse set of entities 
related to people (baby, family), animals (cat, dog), loca¬ 
tions (park, beach), travel (nature, city), to name just the 
top few. A total of 3,343,487 photos and 7,281 videos carry 
machine tags (labels that have been automatically added by 

1 The dataset is available at https://bit.ly/yfccl00md 
2 The features are available at https://bit.ly/yfcclOOmf 
"https://www.flickr.com/services/api/ 


captured - uploaded 



time 


Figure 1: Number of captured and uploaded media objects 
per month in the YFCC100M dataset during the period 
2000-2014. We can see the number of uploads closely fol¬ 
lowing the number of captures, with more recently uploads 
even exceeding captures as older media also is uploaded. 



Figure 2: Global coverage of a sample of one million photos 
from the YFCC100M dataset. One Million Creative Com¬ 
mons Geo-tagged Photos by David Shamma ©®© https: 
//flic,kr/p/olAo2o 

a camera, computer, application, or some other automated 
system). 

Timespan. Although our dataset contains media uploaded 
between the inception of Flickr in 2004 and the creation 
of the dataset in 2014, the actual time span during which 
they were captured is much longer. Some scans of books and 
newspapers have even been backdated to the early 19th cen¬ 
tury when they were originally published. However, we note 
that camera clocks are not always set to the correct date 
and time, and some photos and videos even erroneously re¬ 
port they were captured in the distant past or in the future. 
Figure [l] plots the moments of capture and upload of photos 
and videos during the period 2000-2014, which accounts for 
99.6% of the media objects in the dataset. 

Locations. There are 48,366,323 photos and 103,506 videos 
in the dataset that have been annotated with a geographic 
coordinate, either manually by the user or automatically via 
GPS. The cities in which more than 10,000 unique users 
captured media are London, Paris, Tokyo, New York, San 
Francisco, and Hong Kong. Overall, the dataset spans 249 
different territories (countries, islands, etc.) in the world, 
and also includes photos and videos taken in international 
waters and airspace, see Figure [2] 







Table 2: Top 25 cameras and photo counts in the 
YFCC100M dataset. We have merged the entries for the 
Canon models that have different names in the European 
(e.g. EOS 650D), American (e.g. EOS Rebel T4i) and Asian 
(e.g. EOS Kiss X6i) markets. 


Make 

Model 

Photos 

Canon 

EOS 400D 

2,539,571 

Canon 

EOS 350D 

2,140,722 

Nikon 

D90 

1,998,637 

Canon 

EOS 5D Mark II 

1,896,219 

Nikon 

D80 

1,719,045 

Canon 

EOS 7D 

1,526,158 

Canon 

EOS 450D 

1,509,334 

Nikon 

D40 

1,358,791 

Canon 

EOS 40D 

1,334,891 

Canon 

EOS 550D 

1,175,229 

Nikon 

D7000 

1,068,591 

Nikon 

D300 

1,053,745 

Nikon 

D50 

1,032,019 

Canon 

EOS 500D 

1,031,044 

Nikon 

D700 

942,806 

Apple 

iPhone 4 

922,675 

Nikon 

D200 

919,688 

Canon 

EOS 20D 

843,133 

Canon 

EOS 50D 

831,570 

Canon 

EOS 30D 

820,838 

Canon 

EOS 60D 

772,700 

Apple 

iPhone 4S 

761,231 

Apple 

iPhone 

743,735 

Nikon 

D70 

742,591 

Canon 

EOS 5D 

699,381 


Cameras. Table [2] shows that the top 25 cameras used 
in the dataset are overwhelmingly digital single lens reflex 
(DSLR) models with the exception of the Apple iPhone. 
Considering that the most popular cameras in the Flickr 
community at the moment primarily consist of various iPhone 
model^ this bias in our data is likely due to CC licenses 
attracting a certain subcommunity of photographers that 
differs from the overall Flickr user base. 

Licenses. The licenses themselves vary by CC type with 
approximately 31.8% of the dataset marked appropriate for 
commercial use and 17.3% having been assigned the most 
liberal license that only requires the photographer that took 
the photo to be attributed, see Table [3] 


2.2 Content 

The dataset includes a diverse collection of complex real 
world scenes, ranging from 200,000 street life-blogged pho¬ 
tos by photographer Andy Nystrom, Figure |3(a)[ to snap¬ 


shots of day to day life, holidays and events, Figure 3(b) 


To understand more about the visual content represented in 
the dataset, we used a deep learning approach to find the 
presence of a variety of concepts, such as people, animals, 
objects, food, events, architecture, and scenery. Specifically, 
we applied an off-the-shelf deep convolutional neural net¬ 
work jl3| with 7 hidden layers, 5 convolutional layers and 
2 fully connected ones. The penultimate layer of the convo¬ 
lutional neural network output was employed as the image 

J https://www.flickr.com/cameras/ 


Table 3: A breakdown of the 100 million photos and videos 
by their kind of © Creative Commons license, as © By At¬ 
tribution, © No Derivatives, © Share Alike, and ® Non- 
Commercial. 


License 

Photos 

Videos 

©® 

17,210,144 

137,503 

©®@ 

9,408,154 

72,116 

©®© 

4,910,766 

37,542 

©®® 

12,674,885 

102,288 

©®®@ 

28,776,835 

235,319 

©®®© 

26,225,780 

208,668 

Total 

99,206,564 

793,436 


feature representation to train the visual concept classifiers. 
We used Caffe El to train 1,570 classifiers, each being a 
binary SVM, using 15 million photos taken from the entire 
Flickr corpus; positive examples were crowd labeled or hand¬ 
picked based on targeted search/group results, while nega¬ 
tive examples were drawn from a general pool. We tuned the 
classifiers such that they achieved at least 90% precision on 
a held-out test set. In Table [4] we show the top 25 detected 
concepts in both photos and videos (using the first frame). 
We see a diverse collection of visual concepts being detected, 
from outdoor to indoor images, sports to art, nature to ar¬ 
chitecture. As we realize the detected visual concepts may 
be valuable to the community, we recently released them as 
one of our expansion packs. 

Flickr makes little distinction between a photo and a video; 
however, videos do play a role both on Flickr and in this 
dataset. While photos encode their content primarily through 
visual means, videos also convey this through audio and mo¬ 
tion. Only 5% of the videos in our dataset do not have an au¬ 
dio track. From a manual examination of over 120 randomly 
selected geotagged videos having audio, we found that most 
of their audio tracks were very diverse. Namely, 60% of the 
videos were home-video style with little ambient noise; 47% 
of the videos had heavy ambient noise such as people chat¬ 
ting in the background, sound of traffic, and wind blowing 
into microphone; 25% of the sampled videos contained mu¬ 
sic, either played in the background of the recorded scene, or 
inserted at the editing phase; 60% of the videos did not con¬ 
tain any form of human speech at all, and even for the ones 
that contained human speech; 64% were from multiple sub¬ 
jects and crowds in the background speaking to one another, 
often at the same time. The vocabulary of approximately 
280,000 distinct user tags used for the video annotations 
indeed shows that those describing audio content (music, 
concert, festival) and motion content (timelapse, dance, 
animation) are more frequently applied to videos than to 
photos. When we compare the videos in the dataset to those 
taken from YouTube in the time period 2007-2012, we note 
that YouTube videos are on average longer (Flickr: 39 sec., 
YouTube: 214 sec.). This is likely due to the initial handling 
of videos on Flickr where their length until recently was re¬ 
stricted to a maximum of 90 seconds; recent videos uploaded 
to Flickr indeed tend to be longer. 

2.3 Representativeness 

To create the dataset, we did not perform any specific fil¬ 
tering besides excluding photos and videos that had been 














(a) IMG-9793: Streetcar (Toronto Transit) by 
Andy Nystrom ©®@© https: //flic .kr/p/ 
jciMdz 



(b) Celebrating our 6 th wedding anniversary in 
Villa Mary by Rita & Tomek ©®@© https: 
//flic.kr/p/fCXEJi 


Figure 3: Two photos of real world scenes from photographers in the YFCC100M dataset. 


marked as ‘screenshot’ or ‘other’ by the Flickr user. We did, 
however, include as many videos as possible, because videos 
form a small percentage of media uploaded to Flickr and a 
random selection would have led to relatively few videos to 
be selected. We further included as many photos as possible 
that were associated with a geographic coordinate to en¬ 
courage spatiotemporal research. Together these photos and 
videos form approximately half of the dataset, and the re¬ 
mainder is composed of CC photos randomly selected from 
the entire pool of photos on Flickr. 

To investigate whether our dataset contains a representa¬ 
tive sample of real world photography, we collected an ad¬ 
ditional random sample of 100 million public Flickr photos 
and videos, irrespective of their license, that were uploaded 
during the same time period as those in our dataset. We 
then compared the relative frequency with which content 
and metadata were present in our dataset and in the ran¬ 
dom sample. We found that the average difference in rela¬ 
tive frequencies between two corresponding visual concepts, 
cameras, timestamps (year and month) and locations (coun¬ 
tries) was only 0.02%, with an overall standard deviation of 
0.1%. The maximum difference we observed was 3.5%, due 
to more videos in the YFCC100M having been captured in 
the United States than in the random sample (46.2% vs. 
42.7%). While we found little individual differences between 
the relative frequency of use of any two corresponding cam¬ 
eras in our dataset and in the random sample, at most 0.5%, 
we did observe the earlier mentioned tendency towards more 
professional DSLR cameras in the YFCC100M rather than 
regular point-and-shoot cameras. This notwithstanding, our 
dataset appears to overall exhibit similar characteristics to 
photos and videos present in the entire Flickr corpus. 

2.4 Features and Annotations 

Computing features for 100 million media items is a time- 
consuming and computationally expensive task. Not every¬ 
one has access to a distributed computing cluster, and just 
performing light processing of all the photos and videos on a 
single desktop machine may already take several days. From 
our experience with organizing benchmarks on image an¬ 
notation and location estimation we noticed that accompa¬ 
nying the datasets with precomputed features reduced the 


burden on the participating teams, allowing them to focus 
on solving the task at hand rather than on processing the 
data. As mentioned earlier, we are currently computing a 
wide variety of visual, aural, textual and motion features for 
the dataset, and already have released several of them. The 
visual features span the gamut of global (e.g. Gist), local 
(e.g. SIFT) and texture (e.g. Gabor) descriptors, the aural 
features include power spectrum (e.g. MFCC) and frequency 
(e.g. Kaldi) descriptors, the textual features refer to closed 
captions extracted from the videos, and the motion features 
include dense trajectories and shot boundaries. These fea¬ 
tures, being computed descriptors of the photos and videos, 
will be licensed without any restrictions under the CCO (©®) 
license. Real world data does not have well-formed annota¬ 
tions, which presents the sensemaking of the dataset itself 
as an area of investigation. Annotations, such as bounding 
boxes, segmentations of objects and faces, and image cap¬ 
tions are not yet available for the YFCC100M at present, 
although generating and releasing them is on our roadmap. 

2.5 Ecosystem 

Our dataset has already given rise to an ecosystem of di¬ 
verse challenges and benchmarks, similar to how ImageNet, 
PASCAL and TRECVID have been used by the commu¬ 
nity. For example, the MediaEval Placing Task [3], an annual 
benchmark in which participants develop algorithms for es¬ 
timating the geographic location where a photo or video was 
taken, is currently based on our dataset. To support research 
in multimedia event detection the YLI-MED corpus [l] was 
recently introduced, which consists of 50,000 handpicked 
videos from the YFCC100M that belong to events similar 
to those defined in the TRECVID MED challenge. Approx¬ 
imately 2,000 videos were categorized as depicting one of 
ten target events, and 48,000 as belonging to none of these 
events. Each video was further annotated with additional 
attributes like language spoken and whether the video has a 
musical score. The annotations also include degree of anno¬ 
tator agreement and average annotator confidence scores for 
the event categorization of each video. The authors stated 
that the main motivation for the creation of the YLI-MED 
corpus was to provide an open set without the license re¬ 
strictions imposed on the original TRECVID MED dataset, 










Table 4: The top 25 of 1,570 visually detected concepts in 
the YFCC100M dataset. Photos and videos are counted as 
many times as they contain visual concepts. 


Concept 

Photos 

Videos 

outdoor 

44,290,738 

266,441 

indoor 

14,013,888 

127,387 

people 

11,326,711 

56,664 

nature 

9,905,587 

47,703 

architecture 

6,062,789 

11,289 

landscape 

5,121,604 

28,222 

monochrome 

4,477,368 

18,243 

sport 

4,354,325 

25,129 

building 

4,174,579 

7,693 

vehicle 

3,869,095 

13,737 

plant 

3,591,128 

11,815 

blackandwhite 

2,585,474 

10,351 

animal 

2,317,462 

9,236 

groupshot 

2,271,390 

4,392 

sky 

2,232,121 

11,488 

water 

2,089,110 

15,426 

text 

2,074,831 

5,623 

road 

1,796,742 

12,808 

blue 

1,658,929 

10,273 

tree 

1,641,696 

6,808 

hill 

1,448,925 

6,075 

shore 

1,439,950 

8,602 

car 

1,441,876 

4,067 

head 

1,386,667 

8,984 

art 

1,391,386 

2,248 


while it for instance also could serve as additional annotated 
data to improve the performance of current event detectors. 
Other venues in which our dataset currently features are 
the ACM Multimedia 2015 Grand Challenge on Event De¬ 
tection and Summarization, as well as the ACM Multimedia 
2015 MMCommons Workshop; the latter kicks off the devel¬ 
opment of a research community around annotating all 100 
million photos and videos. Certainly, the utility of the data¬ 
set will grow as more features and annotations are produced 
and shared, whether by ourselves or by others. 

2.6 Strengths and Limitations 

We specifically note the following strengths (©) and limita¬ 
tions (©) of the YFCC100M dataset: 

© Design. Our dataset differs in its design from most mul¬ 
timedia collections. The collection of photos, videos, and 
metadata in our dataset has been curated to be compre¬ 
hensive and representative of real world photography, ex¬ 
pansive and expandable in coverage, free and legal to use, 
and as such intends to consolidate and supplant many of 
the existing datasets currently in use. We emphasize that 
our dataset does not challenge collections that are different 
and unique (e.g. PASCAL, TRECVID, ImageNet, COCO), 
but we instead aspire to make it the preferred choice for 
researchers, developers and engineers with small and large 
multimedia needs that can be easily satisfied by our dataset, 
rather than having them needlessly collect their own data. 

© Equality. The dataset ensures data equality for research, 
which facilitates reproducing, verifying and extending scien¬ 
tific experiments and results. 


© Volume. The dataset, spanning 100 million media ob¬ 
jects, is the largest public multimedia collection to have ever 
been released. 

© Modalities. In contrast with most existing collections, 
our dataset includes both photos and videos, making it a 
truly multimodal multimedia collection. 

© Metadata. Each media item is represented by a sub¬ 
stantial amount of metadata. The dataset includes metadata 
that is often absent in existing datasets, such as machine 
tags, geotags, timestamps, and cameras. While social meta¬ 
data is not included due to their ever-changing nature, it 
can be easily obtained by querying the Flickr API. 

© Licensing. The vast majority of available datasets in¬ 
clude media whose licenses do not allow them to be used 
without explicit permission from the rights holder. While 
fair use exceptions may be invoked depending on the nature 
of use, these will generally not be applicable to research and 
development performed by industry and/or for commercial 
gain. For example, a university spin-off offering a mobile 
product recognition application that displays matching Im¬ 
ageNet images for each detected product will not only violate 
the ImageNet license agreement, but also very likely copy¬ 
right law. Our dataset overcomes this issue providing rules 
on how the dataset should be used to comply with licensing, 
attribution and copyright. 

© Annotations. Our collection at present reflects the data 
as it is in the wild ; there are lots of photos and videos, but 
they are currently associated with limited metadata and an¬ 
notations. We note that our dataset may not and/or can¬ 
not offer every single kind of content, metadata and anno¬ 
tations present in existing collections (e.g. object segmenta¬ 
tions as in COCO, broadcast videos as in TRECVID), al¬ 
though we believe that our efforts and those that spring from 
the ecosystem currently being constructed around the data¬ 
set will offer more depth and richness to the existing content, 
as well as new material, making it more complete and useful 
over time. While a lack of annotations clearly forms a limi¬ 
tation of our dataset, we also see this as a challenge. At the 
scale of 100 million media objects there are enough meta¬ 
data labels for training and prediction of some attributes, 
e.g. geographic coordinates, and there exist opportunities 
to create new methods for labeling and annotation through 
explicit crowdsourced work or through more modern tech¬ 
niques using social community behaviors. In addition, given 
the plurality of existing Flickr datasets and the size of our 
dataset, some overlaps are to be expected, such that existing 
annotations directly transfer over to our dataset. Of particu¬ 
lar note is COCO of which a third is present (about 100,000 
images) in the YFCC100M. Over time we will also release 
the intersections with known datasets as expansion packs. 
We envision that the intersection with existing datasets al¬ 
lows researchers to expand upon what is known and actively 
researched. 

2.7 Guidelines and Recommendations 

Although we listed volume as a strength of our dataset, at 
the same time it can also pose a weakness when having 
insufficient computational power, memory and/or storage 
available. The compressed metadata of all 100 million me¬ 
dia objects requires 12.5GB of hard disk space and, at the 



default pixel resolution used on Flickr, the photos take up 
approximately 13.5TB and the videos 3.0TB. While the en¬ 
tire dataset—metadata and/or content—can be processed 
in a matter of minutes to hours on a distributed computing 
cluster, it may take a few hours to days to do on a single 
machine. Naturally, the YFCC100M can still be used for ex¬ 
periments by only focusing on a subset of the data. Also, 
different fields of research, engineering and science have dif¬ 
ferent data requirements and evaluation needs, and all 100 
million media objects in the dataset are not likely to be 
needed for each and every study. We do note that in the 
computer science literature it is uncommon for a paper to 
describe in enough detail how the dataset they used in their 
evaluations was created, effectively preventing others from 
faithfully replicating or comparing against the achieved re¬ 
sults. One clear challenge for the future is ensuring that 
subsets of the dataset used in experiments can be accurately 
reproduced. To this end, we suggest researchers to forego ar¬ 
bitrary selections from our dataset when forming a subset 
for use in their evaluations, but rather to use a principled 
approach that can be succinctly described. Such selection 
logic should examine one or both of the following two as¬ 
pects of the dataset, namely that (i) the photos and videos 
in the dataset are already randomized, and (ii) the data¬ 
set consists of 10 consecutively numbered hies. As such, a 
selection logic could be as simple as 

“We used the videos in the first four metadata 
hies for training, those in the following three hies 
for development, and those in the last three for 
testing.” 

or more complicated as 

“From all photos taken in the United States, we 
selected the hrst 5 million and performed 10-fold 
cross-validation.” 

Alternatively, the created subset can be made available for 
download described as a set of object identihers that in¬ 
dex into the dataset. As an example, the organizers of the 
MediaEval Placing Task made, in addition to the training 
and test sets, the visual and aural features they extracted 
from the content available for download. We envision the 
research community to also follow this way of using and 
sharing the dataset. 

2.8 Future Directions 

Our dataset provides opportunities for large-scale unsuper¬ 
vised learning, semi-supervised learning and learning with 
noisy data. Beyond this, the YFCC100M offers the oppor¬ 
tunity to advance research, give rise to new challenges and 
solve existing ones. For instance, we note the following chal¬ 
lenges in which our dataset may play a leading role: 

AI and Vision. Large datasets have played a critical role 
in advancing computer vision research. ImageNet [5] itself 
led the way into the development of advanced machine learn¬ 
ing techniques like deep learning m The computer vision 
as a field now seeks to do visual recognition by learning and 
benchmarking from large-scale data. The YFCC100M data¬ 
set provides new opportunities for this effort by developing 
new approaches that harness more than just the pixels. For 
instance, new semantic connections can be made by inferring 
them through relating groups of visually co-occurring tags 


in images depicting similar scenes, where such efforts hith¬ 
erto have been hampered by the lack of a sufficiently large 
dataset. Our expansion pack containing the detected visual 
concepts in the photos and videos can assist with address¬ 
ing this challenge. However, visual recognition goes beyond 
image classification and involves obtaining a deeper under¬ 
standing of images. Object localization, object interaction, 
face recognition, image annotation, are all important corner¬ 
stone challenges that will lead the way to retelling the story 
of an image: what is happening in the image and why was 
this image taken. Flickr, having a rich diverse collection of 
image types, provides the groundwork for total scene under¬ 
standing El in computer vision and artificial intelligence, 
a crucial task that can be expanded upon with our dataset, 
and even more so once additional annotations are released 
by ourselves and by others. 

Spatiotemporal Computing. Beyond pixels, we empha¬ 
size time and location information as key components for re¬ 
search that aims at understanding and summarizing events. 
Starting with location, the geotagged photos and videos, 
about half of the dataset, provide a comprehensive snap¬ 
shot of photographic activity over space and time. In the 
past, geotagged media has been used to address a variety 
of research endeavors, such as location estimation [9], event 
detection EE finding canonical views of places 0, and vi¬ 
sual reconstruction of the world m Even understanding 
styles and habits have been used to reverse lookup authors 
online EE Geotagged media in the YFCC100M dataset can 
help push the boundaries in these research areas even fur¬ 
ther. 

More data brings new discoveries and insights, yet it si¬ 
multaneously also makes searching, organizing and present¬ 
ing the data and findings more difficult. The cameraphone 
has enabled people to capture more photos and videos than 
they can effectively organize. One challenge for the future is 
therefore devising algorithms that are able to automatically 
and dynamically create albums for use on personal comput¬ 
ers, cloud storage or mobile devices, where desired media 
and moments of importance can be easily surfaced based 
on simple queries. Harnessing the spatiotemporal context at 
capture time and at query time will take a central role in 
successfully addressing this challenge. 

The previous challenge speaks towards social computing 
efforts aimed at understanding events in unstructured data 
through the reihcation of the photo with space and time. 
While GPS-enabled devices are capable of embedding the 
precise time, location and orientation of capture in the meta¬ 
data of a photo, in many instances this information is un¬ 
available or out of date: seconds, hours or sometimes even 
months. In addition, people frequently forget to adjust the 
camera clock to the correct timezone when traveling. These 
kinds of issues pose problems for the accuracy of any kind of 
spatiotemporal data analysis, and new challenges in compu¬ 
tational photography therefore include devising algorithms 
that can either fix or are resilient against the erroneous. 

Digital Culture and Preservation. What we know to be 
UGC has grown from simple video uploads and bulletin 
board systems; life online has come to reflect culture. These 
large online collections tell a larger story about the world 
around us, from consumer reviews m that speak to how 
people engage with the spaces around them to 500 years 
of scanned book photos and illustrations m that describe 


how concepts and objects have been visually depicted over 
time. Beyond archived collections, the photostreams of in¬ 
dividuals represents many facets of recorded visual informa¬ 
tion, from remembering moments and storytelling to social 
communication and self-identity ESI- This presents a grand 
challenge of sensemaking and understanding digital archives 
from non-homogeneous sources. Photographers and curators 
alike have contributed to the larger collection of Creative 
Commons images, yet little is known on how such archives 
will be navigated and retrieved, or how new information can 
be discovered therein. Our dataset offers avenues of com¬ 
puter science research in multimedia, information retrieval, 
and data visualization in addition to the larger questions of 
digital libraries and preservation. 

3. CONCLUSIONS 

Data is one of the core components of research and develop¬ 
ment. In the field of multimedia, datasets are usually created 
with a single purpose in mind, and as such lack reusability. 
Additionally, datasets generally are not or may not be freely 
shared with others, and as such lack reproducibility, trans¬ 
parency and accountability. To overcome this conundrum, 
we have released one of the largest datasets ever created, 
consisting of 100 million media objects published under a 
Creative Commons license. Our dataset has been curated to 
be comprehensive and representative of real world photog¬ 
raphy, expansive and expandable in coverage, free and legal 
to use, and intends to consolidate and supplant many ex¬ 
isting collections. Our dataset encourages the improvement 
and validation of research methods, reduces the effort of ac¬ 
quiring data, and stimulates innovation and potential new 
data uses. We have further provided rules on how the data¬ 
set should be used to comply with licensing, attribution and 
copyright, and offered guidelines on how to maximize com¬ 
patibility and promote reproducibility of experiments with 
existing and future work. 
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